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Abstract — Image search is popular and welcome currently, however the rareness of labels available may influence the 
performance of most commercial search engines built on traditional vector space model. To bridge the semantic gap 
between text labels and text queries, we propose one simple approach to extend existing labels using thesaurus. Different 
from naive approach where synonyms of tokens contained in a label are simply added, we employ three metrics to filter 
out those non-appropriate candidate labels (combination of synonyms): user log, semantic vector and context vector. The 
experiment results indicate that the proposed method has impressive performance, in term of effectiveness and efficiency. 
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I. INTRODUCTION 

The amount of multimedia resource search request, like image, video or music, has been increasing on most 
commercial search engine. For example, based on our own data, the PVs (page-view) contributed by image search occupy 
over 10% of our total flow on Roboo© ( i. roboo. com , a popular mobile search engine in China). The ratio will be 

much higher if we count all non-text resource retrieval service. This is exactly consistent with what we understand about the 
users of mobile search: they prefer entertainment resource during their non-office time [11]. In the remaining text, we will 
focus the discussion on image annotation, but what proposed here can be applied to video or music as well. 

However, there is one critical challenge posed on us regarding these multimedia resources, that is their labels are 
normally quite short compared with Web document. One study on our own image repository indicates that the average 
length of image labels is 5.18(characters). Besides, the average query length on image is about 3.52 [11], which worsen the 
performance of image search further since obvious there is a gap between image labels and textual queries. Since manually 
annotating images is a very tedious and expensive task, image auto -annotation has become a hot research topic in recent 

Although many previous works have been proposed using computer vision and machine learning techniques, image 
annotation is still far from practicall i 2]. One reason is that it is still unclear how to model the semantic concepts effectively 
and efficiently. 

The other reason is the lack of training data, and hence the semantic gap can not be effectively bridged [12]. 
Therefore, till now, almost all known running commercial image search engines depend on labels attached to images, 
including ours. The direct result of this search model is that: 

• Hit missing is not avoidable due to the short labels available; 

• One image can only serve very few queries. For example, with the example shown in Fig 1., the image has label 
"ttlf*$^l:" (Sexy auto model). Based on current VSM-based (Vector Space Model) search model, the image will 
and only will be retrieved upon query of "ttS", "$H", or "ttSS^lT. Though '"ttilS^+S" normally means 
implicitly " jt'ic" (pretty girl, one quite popular query as indicated by our log), this example image will not appear 
in the query result of "Hit" due to the short of label " ji^t". 




Fig. 1. One image example with label "jSiM-^M" (Sexy auto model) 



Generally, these two aspects together construct the so-called semantic map between target r< 

shown as Path 3 in Fig 2. Ideally, we hope the query can directly be matched by some resources, like Path 3, 
e not always so fortunate in real world. Two possible options to solve the problem: 
Query extension (Path 1)[1,8], 
Label extension (Path 2). 
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Fig. 2. The path connecting query and 



In this paper, we choose Path 2 si 



s believed tc 



approach than Path 1 based on the following 
choose candidate 



a work normally is done offline, so efficiency is 
approaches. Normally, better methods require more computing 

Given those zero-hit queries, we may "attach" them purposely to some resources via some appropriate 
algorithm; 

By preparing the extended labels offline first, the online response speed would not be influenced; 
No matter how well one query extension may work, it is still built on the labels; Rich labels play as the basis for 

3 meet specific application 



Extended labels may evolve, added or removed from an image, with t 
requirements. 



In this paper, label extension is based on thesaurus. However, different from naive approach, we propose three 
metrics to filter out those non-appropriate labels. In Section 2, the proposed method is explained in detail. In Section 3, 
experiment results are presented to demonstrate the effectiveness and efficiency. Short conclusion and future work are 
shared in Section 4. 



II. LABEL EXTENSION 

In this section, how labels are extended by us will be discussed in detail. The overall procedure is shown in Fig 3, 
and each step will be discussed in the following sub-sections. 

2.1 Word Segmentation (Tokenization) 

In English, there is space between word and word. However, in Chinese, one word may contain one or more than 
one Chinese character, and all characters appear in concatenation in a sentence. Therefore, Chinese text has to be segmented 
or tokenized first to extract words before they can be analyzed or indexed [2,3]. In Fig 3, the input "Label" refers to the 
original label available to an image, and "Word Segmentation" module outputs a series of tokens appearing as their relative 
order in the original label string. Repeated tokens are left unchanged to keep the original semantic meaning. These tokens 
play as the input for next step - Gel candidate using s\ nori) ins. 

2.2 Collect Candidates Using Thesaurus 

Thcsauius|4.7J plays as an important role here because we have to select the candidates very carefully to keep the 
original semantic meaning unchanged. Synonym is believed an ideal candidate considering that normally thesaurus is 
constructed in a serious way normally. The appropriateness of synonym is believed as a solid basis for the success of our 
method. Actually, thesaurus technique is widely applied in information processing and information retrieval, such as 
detecting text similarity over short passages [5], Query expansion using synonymous terms [6], et al. 
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Fig. 3. Overall procedure of label extension proposed by u 

Here, given the sequence of tokens, the replac 
and independently.This candidate construction procedure 

1. With one label L, its concspondiny token sequence as output by word segmentation component is denoted as 

TS L ={T l ,T 2 ,...,T k } , said T i represents one token. 

2. With each T i , we get its synonyms via reference to one open thesaurus edited and maintained by[7]; all the 
synonyms plus T t itself construct one set denoted as SyniTj ) ; 

3. With one instance selected from each SyniT^ , and concatenate them together, we get one candidate label. All 
possible candidates are denoted with CL = {CL t = C,C 2 ...C k I C ; e Syn(Tj),i = l...k;j = 1..L} where 

L = nJLj I Syn(T. ) I , and I Syn(T t ) I is the cardinality of set Syn{T ) . 



With this method, many more possible labels are generated. For example, assuming one label "ttS^fJt", it is 
segmented into two tokens "f4®"+"^tM" first. By referring to thesaurus, we notice that there are five synonyms 
corresponding to "^FH" , and four synonyms corresponding to "$H[". Then, we may have 30 labels ( (5+1) * (4+1) = 30), 
I ' I J I J ' i etc However, some labels may be meaningless by combining the candidate synonyms, though 

each synonym token itself is acceptable. For example, one possible extension of "SMicA" (m\ girl), like "l^fl'jictt" (my 
female) may not be appropriate although "io|4" (female) may be a acceptable synonym of "icA" (girl or woman). From 
the example here, it is noticed that deciding whether one synonym alone is appropriate or not may not be easy or possible. 
The more practical and reasonable approach is to evaluate each candidate label separately, though it results with more 
computing work. How it can be achieved will be discussed in the following three sub-sections. 

2.3 Evaluation Using User Log 

User log records the us 
ts and behaviors of users bu 



;' queries and the corresponding frequencies, and it 
i able to allow us to evaluate the appropriateness of oi 



CL = {CL. = C.C....C. I C. e Syn(T .),i =\...k;j =l..L] 
l 1 2 K J J 



all the candidate labels as output by the end 



of last step. 
Log = { (q { fj y.\= IN- represents the log corpus, where q t is the 
query and f { is its frequency being issued; 
I. Given the log data and generated candidate labels, we filter those not found it 
The log, with CL : = {CL.\C. e CL and C. e Log } left. In the current vers 

we require COMPLETE match while determining filtering one candidate out or 
not. Actually, this filtering is also required by Step 4 and Step 5 as indicated in 
Fig 3; 
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3. With each CL &CL' , one score is assigned, denoted as SLog(CL ) and SLog(CL ) = f I £ f. . 
Therefore, SLogiCLf) is the result of Step 3 in Fig 3. Based on our experiments,this is effective to filter some non- 
appropriate candidates, and give a reasonable score for each accepted candidate with the heuristics that the more frequently 
queried, the more important one candidate is. However, still there is "noise" exist in the remaining candidates, and we need 
do further filtering. 

One bonus advantage of this method is that we may "create" images for some queries originally having no matched results 
by appending some more labels. 

2.4 Evaluation Using Semantic Vector 

Given a set of candidate labels, it is hard, or impossible, to simply tagging "right" or "wrong". For instance, "^C 
K"( woman) has synonyms like " 'C j ! '(female) or " K M '"(wife). Though both of are acceptable, it does NOT mean that they 
are "equal" in meaning as compared to the original label "icA". In this section, we propose to rank the remaining 
candidates based on inch' semantic similarity relative to the original label. 

Given two sti ni' lil n n I il 'j>", they have the same meaning even though they share no any 

common token, which results in similarity score in traditional vector space. To address this problem, a more reasonable as 
well as reliable measure about the closeness among labels is desired, and it is expected to use more comprehensive 
background information for reference, instead of simple term-wise happenings. Fortunately, the Web provides a potential 
rich source of data thai may be utilized to address this problem, and modern search engine helps us to leverage the large 
volume of pages on the web to determine the greater context for a query in an efficient and effective manner. 

Our solution to this problem is quite simple, and it is based on the intuition that queries which are similar in 
concept are expected to have similar context. For each query, we depend on the search engine to seek for a group of related 
and ranked documents, from which a feature vector is abstracted. Such feature vector contains those words that tend to co- 
occur with the original query, so it can be viewed as the corresponding context of the query, providing us more semantic 
background information. Then, the comparison of queries becomes the comparison of their corresponding feature, or 
context, vectors. Actually, we found that there were similar application of this method to determine the similarity between 
more objects on the Web [8,10]. 

Here, one candidate label or someone token contained is viewed as a query, and be submitted to a search engine to 
retrieve relevant documents from which the so-called "semantic background'' knowledge could be abstracted and referred 
later. The procedure is formalized as follows: 
Let q represent a query, and we get its feature vector (FV), denoted as FV(q) - with the following sequential steps: 

1. Issue the query q to a search engine; for now, let us ignore the actual algorithms of the search engine and assume 
that the approach is generalizable to any engine that provides "reasonable" results: 

2. Let D(q) be the set of retrieved documents. In practice, we only keep the top 

k documents, assuming that they contain enough information, so D(q) = {d ] ,d 2 ,...,d k } . 

3. Compute the TF-IDF term vector tv i for each document^, e D(q) , with each element as 
fv.U) = C/. y xlog(-^) 

where tf tj is the frequency of the jth term in d t , N is the total number of documents available behind the search engine, 
and (If is the total number of documents that contain the jth term. The TF-IDF is widely applied in the information retrieval 
OR) community and has been proved work well in many applications, including the current approach. 

4. Sum up tv t ,i = I..K , to get a single vector. Here, for the same term, its IDF,i.e. log(^r) , is known as identical 
in different tV i , so the sum of their corresponding weights can be calculated; 

5. Normalize and rank the features of the vector with respect to their weight (TF-IDF), and truncate the vector to 
its M highest weighted terms. What left is the target FV(q) . 

After we get the feature vector for each query of interest, the measure of semantic similarity between two queries is defined 

Sim(q i ,q j ) = FV(q i )»FV(q j y, 
In our label extension, we evaluate the similarity between the original label L and someone candidate label CL from two 
aspects: 

1 . One is from the overall, with a score denoted as Score m : 

Score m =Sim(L,CL) 

2. The other is from the partial view with corresponding score denoted as Score Partial : 

Score Partlal =max(Sim(Tf,T i CL )} 
where r 1 refers to the ith token in original label L . and7! CL refers to 
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the ith token in the caniiidalc label CL i Flic candidate construction algorithm 
ensures that L and CL have same number of tokens, so the calculation 

here( has no problem); 
3. Given a candidate label, CL , its final score is determined by 

S W (CL,) = max( Score All , Score Parlial ) . 
S sv (CL i ) therefore, is the output of Step 4 in Fig 3 . 

2.5 Evaluation Using Context Vector 

Till now, two dimensions have been introduced to measure if one candidate label is acceptable or not, including 
query log and semantic information. However, ii is noticed that due to the feature of our index repository and queries 
received (most about entertainment), the selection procedure is more or less biased. To weaken this influence so that those 
more serious or formal labels would not be filtered out, one additional dimension called Context Vector (CV) based on 
People's Daily news Corpus[9] is introduced. How a context vector is got and how to determine the similarity score based on 
context vectors are formalized as follows: 

1 . Given a label L , we find all the "contexts" containing L , 

{ ( . T H ,L f M , ,. In the current version, we limit the context length as 1 , 
that is we only abstract {(T^ L,T 2 )} ; 

2. Given all contexts ready, {(7j,L,7, ) 1 . we summarize till the happenings of 
someone token 1] that appears in the triple above, ignoring its position in the 
triple before or after L . This allows to get the so-called context vector of 

L , CV(L) = {tf v tf 2 ,...,tf m } where tf refers to someone token's frequency and they are ranked in a decreasing order 
in term of frequency: 
3. The similarity between the original label L and candidate label CL is 

e7/m(L,CL,. )= CV % •)CV C£, . For easy reference, this score is denoted as e7 CF (CL ; ) . 

So, the score by Context Vector is similar to that by Semantic Vector, both depend on more some 3rd-party corpus 
to gain deeper understanding of the labels of interest. The only difference exists on how to construct the vectors for 
comparison. Table 1 lists two examples where the similarity scores based on two different vectors are listed for different pair 
of labels, and it indeed indicates that the two scores are complementary to some extend. 



Table 1. Two examples about Semantic Vector a 


s. Context Vector 


Example pair 


Similarity based on 
Semantic Vector 


Similarity based on 
Context Vector 


ft ft vs ft* 
(both refer to happiness) 


0.433 


0.00 


*t£ vs %m 

(pretty girl vs. stunner) 


0.00 


0.268 



2.6 Reach the Output 

Till now, we have introduced how to construct candidates given a label, and how to ranking the candidates based 
on different scores, S Log , S sv and S cv . To get the final score of someone label, the three scores need to be combined in some 
manner. In the current version, weight is assigned to each score, and the\ arc added linearly as: 
Score = a, x S Log + a 2 x S sv + a 3 x S cv 

To make them comparable and addable, each indi\ idual score is normalized before 
be summarized. Based on observation via lots of experiments and manually study, a x = 0.8 , a 2 = 1.3 fl 3 = 1.0 are chosen 
since satisfactory results are produced then. 

III. EXPERIMENTAL STUDY 

In our experiment, given each label, we apply the procedure shown in Fig 3 to get candidate labels, and some 
information is presented here: 
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Table 2. Summary information of experiments 

Total # of images in test repository 1,957,090 

Total # of different labels 270,192 

The ratio of image # to label # 7T 

Average length of label 5.18 

Total time to process all images 3,296(sec.) 

Total # of labels being extended finally (with at least one 1,992 

new candidate label output) 

Total # of images influenced, i.e. with new label 132,5816 

generated 

From Table 2, we could conclude that: 

1. Many images share same labels (the ratio of image # to label # is 7:1); 

2. Labels available are short; 

3. The running performance is acceptable, costing less than 1 hour to process about two millions of images on a 
common PC (CPU 1.6 GHz; 2.0GB RAM); 

4. Although there are only 1,992 labels been extended actually, about 1.3 millions of images (over two third) are 
influenced, which furthermore confirms the first point, i.e. many images have same labels. Actually, these labels 
are also frequently queried, which reflects the common preference among human beings, no matter image 
annotators or query submitters. 

With the 1,992 labels got, four editors are employed to do manual checking, finally 1405 labels are thought as 
appropriate, about 70.5% (1405/1992 1 acceptance ratio. Two primary causes may explain the non-appropriate extensions: 

1 . The quality of thesaurus relative to our application. Some synonyms may be acceptable in other fields, but not to us 
in mobile search application. It may be worthy of effort to build one thesaurus customized for our field on long 

2. Another problem is about polyseme. Although we carefully evaluate the appropriateness of one candidate based on 
query log, semantic vector and context vector, extra effort lo furl her di^.ambiyuaiion is still desired. 

IV. CONCLUSION 

Image search is one popular service online. However, the rareness of annotation prevents is widely known as a 
challenge. In this paper, we propose a simple architecture is proposed to enrich the labels available for an image. Firstly, we 
construct a series of candidates for a given label based on simple rule and thesaurus. Then, three different dimensions of 
information are employed to rank the candidates, including user log, semantic similarity and context similarity. The 
experiments indicate that over 709? of generated labels are regarded as reasonable and acceptable. Although our discussion 
is based on image, the proposed method obviously can be applied to other multimedia resources suffering the similar 
problem, e.g. video or theme to stall on mobile phone. There is still much space to improve the underlying performance of 
our work. For example, we admit that filtering candidates by the fact if they appear in query log is somewhat too restrict. 
Partial containment relation or similarity between queries and labels may bring us more labels. Other information or metrics 
may be introduced and be included in selection also in applications. 
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